2022-04-07

Syllabus

  • Static word embeddings
    • Frequency based methods, word2vec, GloVe, fastText, evaluation of embeddings
  • Contextual word embeddings
    • ELMo, Transformers and attention, BERT, sentence embeddings, contrastive learning
  • Additional topics
    • Geometry of the embedding space, bias, sentiment, multilingual embeddings
  • Topological data analysis
    • Hyperbolic embeddings, singularities and topological polysemy

Motivation & Methods

Motivation: Winograd schemas

  • The trophy doesn’t fit into the brown suitcase because it’s too large.
  • The trophy doesn’t fit into the brown suitcase because it’s too small.


Task: Co-reference resolution

Motivation: Winograd schemas

  • The city councilmen refused the demonstrators a permit because they feared violence.
  • The city councilmen refused the demonstrators a permit because they advocated violence.


Task: Co-reference resolution

  • easy for humans to solve
  • difficult for computers
    • solution relies on real-world knowledge and common sense reasoning

Motivation: Winograd schemas

  • I put the cake away in the refrigerator. It has a lot of butter in it.
  • I put the cake away in the refrigerator. It has a lot of leftovers in it.

Motivation: Garden-path sentences

  • The old man the boat.

Motivation: Garden-path sentences

  • The complex houses married and single soldiers and their families.

Motivation: Garden-path sentences

  • The horse raced past the barn fell.

Methods

Some of the word vectors from a 100 dimensional fastText embedding trained on a Wikipedia corpus; projected to 2 dimensions using t-SNE.

Applications of word embeddings

  • Word-sense induction (WSI) or word-sense discrimination: task is the identification of the senses/meanings of a word
  • Output: clustering of contexts of the target word, or a clustering of words related to the target word

Example:

  • target word “cold”
  • collection of sentences:
    • “I caught a cold.”
    • “The weather is cold.”
    • “The ice cream is cold.”

Output: ?

  • Word-sense disambiguation (WSD): relies on a predefined sense inventory, and the task is to solve the ambiguity in the context
  • Output: identifying which sense of a word is used in a sentence

Part-of-speech tagging

  • grammatical tagging: decide which part of speech (noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection) a word in a text corpus belongs to

PoS might depend both on definition of the word and its context

  • in language a large portion of word-forms are ambiguous
  • example from Wikipedia:
    • “dogs” usually is a plural noun,
    • but can also be a verb as in the sentence “The sailor dogs the hatch.”
  • example where order matters:
    • “can of fish”
    • “we can fish”

Sub-categories for PoS tagging:

  • for nouns, the plural, possessive, and singular forms can be distinguished.
  • “case” (role as subject, object, etc.), grammatical gender, and so on
  • verbs are marked for tense, aspect, and other things

Other tagging tasks:

Text classification

  • Document classification: spam / not spam

  • Review classification: positive / negative

  • Sentiment: positive / neutral / negative

  • single-label classification / multi-label classification

(https://lena-voita.github.io/nlp_course.html)

  • Generative models:
    • learn undelying data distribution \(P(x, y) = P(x | y) \cdot P(y)\)
    • prediction: given an input \(x\), pick a class with the highest joint probability \(y = \mathop{\mathrm{argmax}}_{k} P(x | y = k) \cdot P(y = k)\)
      • maximum a posteriori (MAP) estimate
  • Discriminative models:
    • learn the boundaries between classes (i.e. learn how to use the features)
    • prediction: given an input \(x\), pick a class with the highest conditions probability \(y = \mathop{\mathrm{argmax}}_{k} P(y = k | y)\)
      • maximum likelihood estimate (MLE) of parameters

Bag of Words (BoW) assumption: word order does not matter

Plan

Static word embeddings

Frequency based methods

term-document matrix:

  • each document is represented by a vector of word counts

term frequency – inverse document frequency (tf-idf):

  • sparse vectors
  • words are represented as a simple function of the counts of neighbors

word2vec (Mikolov, Chen, et al. 2013; Mikolov, Sutskever, et al. 2013)

(Jurafsky and Martin 2009)

GloVe (Pennington, Socher, and Manning 2014)

(https://nlp.stanford.edu/projects/glove/)

fastText (Bojanowski et al. 2016)

Contextual word embeddings

  • I’m going to the bank to withdraw some money.
  • We’re sitting on the river bank with some friends.

Recurrent methods: ELMo

Transformers

(Vaswani et al. 2017)

Bidirectional Encoder Representations from Transformers (BERT)

Huggingface transformers

Sentence embeddings

Geometry of the embedding space

(Mimno and Thompson 2017)

(Nakashole and Flauger 2018)

Bias

(Bolukbasi et al. 2016)

Sentiment

(Yu et al. 2017)

Multilingual embeddings

Manifolds and topology

Hyperbolic embeddings

(Nickel and Kiela 2017)

  • Poincaré GloVe (Tifrea, Bécigneul, and Ganea 2018)

Hyperbolic image embeddings (Khrulkov et al. 2019)

Singularities and Topological Data Analysis (TDA)

  • manifold hypothesis does not hold at all points of certain static word embeddings

(Jakubowski, Gasic, and Zibrowius 2020)

  • topological polysemy: count the number of “meanings” around a singularity

Thank you!

Organisation

  • Schedule:
  • Gebäude 25.12 / 2512.02.33 (& live stream available online)
  • Each week talks by students (1 or 2 speakers per session, 70 minutes in total)
    • there should be enough time for questions and a discussion
  • Guest lecture “Multilingual embeddings” on 2022-06-30
  • The final grade is based on your presentation
  • Hand in your extended abstract (ideally .tex, .bib files and compiled .pdf; maximum 2 pages with references) via ILIAS

References

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2016. “Enriching Word Vectors with Subword Information.” CoRR abs/1607.04606. http://arxiv.org/abs/1607.04606.

Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam Kalai. 2016. “Man Is to Computer Programmer as Woman Is to Homemaker? Debiasing Word Embeddings.” CoRR abs/1607.06520. http://arxiv.org/abs/1607.06520.

Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. “Word Translation Without Parallel Data.” CoRR abs/1710.04087. http://arxiv.org/abs/1710.04087.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805.

Jakubowski, Alexander, Milica Gasic, and Marcus Zibrowius. 2020. “Topology of Word Embeddings: Singularities Reflect Polysemy.” In Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, 103–13. https://arxiv.org/abs/2011.09413.

Jurafsky, Daniel, and James H. Martin. 2009. Speech and Language Processing. MIT Press.

Khrulkov, Valentin, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan V. Oseledets, and Victor S. Lempitsky. 2019. “Hyperbolic Image Embeddings.” CoRR abs/1904.02239. http://arxiv.org/abs/1904.02239.

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. 2015. “Bilingual Word Representations with Monolingual Quality in Mind.” In NAACL Workshop on Vector Space Modeling for NLP. Denver, United States.

Mikolov, Tomás, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1301.3781.

Mikolov, Tomás, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” CoRR abs/1310.4546. http://arxiv.org/abs/1310.4546.

Mimno, David, and Laure Thompson. 2017. “The Strange Geometry of Skip-Gram with Negative Sampling.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2873–78. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1308.

Nakashole, Ndapa, and Raphael Flauger. 2018. “Characterizing Departures from Linearity in Word Translation.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 221–27. Melbourne, Australia: Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-2036.

Nickel, Maximilian, and Douwe Kiela. 2017. “Poincar é Embeddings for Learning Hierarchical Representations.” CoRR abs/1705.08039. http://arxiv.org/abs/1705.08039.

Pennington, Jeffrey, Richard Socher, and Christopher Manning. 2014. Glo Ve: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing ( EMNLP), 1532–43. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.

Reimers, Nils, and Iryna Gurevych. 2019. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” CoRR abs/1908.10084. http://arxiv.org/abs/1908.10084.

Tifrea, Alexandru, Gary Bécigneul, and Octavian-Eugen Ganea. 2018. “Poincar é GloVe: Hyperbolic Word Embeddings.” CoRR abs/1810.06546. http://arxiv.org/abs/1810.06546.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” CoRR abs/1706.03762. http://arxiv.org/abs/1706.03762.

Yu, Liang-Chih, Jin Wang, K. Robert Lai, and Xuejie Zhang. 2017. “Refining Word Embeddings for Sentiment Analysis.” In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 534–39. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1056.